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Abstract 

Dynamic voltage and frequency scaling proves to be an efficient way of reducing energy consumption 
of servers. Energy savings are typically achieved by setting a well-chosen frequency during some program 
phases. However, determining suitable program phases and their associated optimal frequencies is a 
complex problem. Moreover, hardware is constrained by non negligible frequency transition latencies. 
Thus, various heuristics were proposed to determine and apply frequencies, but evaluating their efficiency 
remains an issue. 

In this paper, we translate the energy minimization problem into a mixed integer program that 
specifically models realistic hardware limitations. The problem solution then estimates the minimal 
energy consumption and the associated frequency schedule. The paper provides two different formulations 
and a discussion on the feasibility of each of them on realistic applications. 


1 Introduction 

For a very long time, computing performance was the only metric considered when launching a program. 
Scientists and users only cared about the time it took for a program to finish. Though still often true, the 
priority of many hardware architects and system administrators has shifted to caring more and more about 
energy consumption. Solutions reducing the energy enveloppe have been put forth. 

Among the different existing techniques. Dynamic Voltage and Frequency Scaling (DVFS) proved to be 
an efficient way to reduce processor energy consumption. The processor frequency is adapted according to 
its workload: When the frequency is lowered without increasing the execution time, the power consumption 
and energy are reduced. 

With parallel applications in general, and more precisely with MPI applications, reducing frequency on 
one processor may have a dramatic impact on the execution time of the application: Reducing processor 
frequency may delay a message sending, and maybe its reception. This may lead to cascading delays 
increasing the execution time. To save energy with respect to application deadline, two main solutions 
exist: online tools and offline scheduling. The former try to provide the frequency schedule during the 
execution whereas the latter provide it after an offline study. They both require the application task graph 
(either through a previous execution or by focusing on iterative applications). 

Many online tools [?, ?] identify the critical path: the longest path through the graph, and focus on 
processors that do not execute these tasks. Typically, when waiting for a message, the processor frequency is 
set to the minimal frequency until the message arrives [?]. Although online tools allow some energy savings, 
they provide suboptimal energy saving because of a lack of application knowledge. 

On the other hand, offline scheduling algorithms [?, ?] provide the best frequency execution of each task. 
However, none of the existing algorithms consider most current multi-core architectures characteristics: (i) 
cores within the same processor share the same frequency [?] and (ii) switching frequency requires some 
time [?]. 

This paper presents two models based on linear programming which find the execution frequencies of 
each task while taking into account the mutlicore architecture constraints and characteristics (section [3]) 
previously described. Moreover, we allow the execution time to be increased if this leads to more energy 


savings. The user provides a maximum performance degradation that she can tolerate. The presented models 
provide optimal frequency schedule which minimizes the energy consumption. However, when considering 
large applications and large machines, no current solver can provide a result, even parallel ones. The reason 
behind this issue is discussed in section |31 

2 Context and execution model 

We consider MPI applications running on a multi-node platform. The targeted architectures consider the 
following characteristics: (i) the latency of frequency switching is not negligible and (ii) cores within the 
same processor share the same frequency. 

A process, running on every core, executes a set of tasks. A task, denoted T^, is defined as the computa¬ 
tions between two communications. The application execution is represented as task graph where tasks are 
vertices and edges are messages between the tasks. Figure [1] is an example of the task graph running on two 
processes. One process executes tasks Ti and T 2 while the other one executes tasks T 3 and T4. 
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Figure 1: Task graph 


Before going into more details on the execution model, let us provide an example of the problem we want 
to solve. Consider the example provided in Figure [5J The application is executed on 3 cores, 2 in the same 
processor and one in another processor. Tasks Ti, T 2 , T 3 and T 4 are executed on processor 0 while tasks T 5 
and Te are executed on processor 1. In order to minimize the energy consumption through DVFS, we make 
the same assumption as [?]: tasks may have several phases and each phase can be executed at a specihc 
frequency. Typically on Figured task Ti is divided into 3 phases. The first one is executed at frequency /i, 
the second one at frequency /2 and the last one at frequency f^. 

As stressed out before, setting a frequency takes some time. In other words, when a frequency is requested, 
it is not set immediately. Thus, on Figure [21 when frequency /2 is requested, it is set some time after. One 
needs to be careful of such situations since a frequency may be set after the task which it was requested from 
is over. 

Moreover, cores within the same processor run at the same frequency. Hence, on Figure |21 when fi is 
first set on processor 0 , all the tasks being executed at this time (Ti and T 3 ) are executed at frequency fi. T 5 
is not affected since it is on another processor. To provide the best frequency to execute each task portion, 
we need to consider all parallel tasks which are executed at the same time on the processor. 
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Figure 2: Frequency switch latency 0 


^Note that only the latency of the first request is represented 






















Our model requires the task graph to be provided (through profiling or a complete execution of the 
application). Thus, we consider deterministic applications: for the same parameters and the same input data, 
the same task graph is generated. In order to guarantee that edges are the same over all possible executions, 
one has to make sure that the communications between the processes are the same. Non deterministic 
communications in MPI are either receptions from an unknown source (by using MPI_Any_Source in the 
reception call), or non-deterministic completion events {MPPWaitany for instance). Any application with 
such events is considered as non-deterministic, thus falls out of the scope of the proposed solution. 



ilack 


Figure 3: Slack time 

Tasks within a core are totally ordered. If a task ends with a send event, then the following task re¬ 
starts exactly at the end of T^. On Figure [U task T 2 starts exactly after Ti ends. On the other hand, when 
a task is created by a message reception (T4 on Figure [1]), it cannot start before all the tasks it depends on 
finish (Ti and T 3 ) and it has to wait for the message to be received. If the message arrives after the end of 
the task which is supposed to receive it, the time between the end of the task and the reception is known as 
slack time. On Figure [31 tasks Ti sends a message to T 3 but T 3 ends before receiving the messages creating 
the slack represented by dotted lines. 

A task energy consumption Ei is defined as the product of its execution time execi and its power 
consumption Pi. Since the application is composed of several tasks, its energy consumption can be expressed 
as the sum of the energy consumption of all the tasks. Thus, the goal translates into providing the set of 
frequency to execute each task. Hence, one can calculate the application energy consumption as: 

E = '^{Ei) = '^{execi X Pi) (1) 

i i 

Minimizing the energy consumption of the application is equivalent to minimizing E in equation CD- 

For each task Ti, both exeCi and Pi depend the frequency of the different phases of the task. In addition, 
tasks are not independent since when executed in parallel on the same processor, the tasks share the same 
frequency. Moreover, the overall execution time of the application depends on all the execi and the slack 
time. To minimize the energy consumption while still controlling the overall execution time, we express the 
problem as a linear program. 


3 Building the linear program 

The following paragraphs describe how the energy minimizatoin problems translates into a linear program¬ 
ming. We first describe the precedence constraints between the tasks, then we describe two formulations 
which consider the architecture constraints. Finally, we discuss the feasibility of the described solutions. 

3.1 Precedence constraints 

Let Ti be a task defined by its start time hTi and its end time eT^. The beginning of tasks is bounded by the 
precedence relation between them. As already stressed out, a task cannot start before its direct predecessors 
complete their execution. As explained in section [31 if Ti sends a message, its child task Tj starts exactly 
when Ti ends since the end of the communication means the beginning of the next task. This translates to: 


bTj = eTi 






bT, 

Beginning of a task Ti 

eTj 

End of a task Ti 

bTsi 

Beginning of a slack task Tsi 

eTsi 

End of a slack task T Si 

f 

exec\ 

The execution time of a task Ti if executed completely at frequency / 

tT/ 

The time during which the task Ti is executed at frequency / 

H 

The fraction of time a task Ti spends at frequency / 

Ml 

Message transmission time from task Tj to task Ti 


Table 1: Task variables 


On the other hand, when Ti ends with a message reception from T^, one has to make sure that its 
successor task Tj starts after both tasks end. Moreover, as pointed out in section [H when a task receives a 
message, some slack may be introduced before the reception. Slack is handled the same way tasks are: it 
has a start and an end time and it can be executed at different frequencies depending on the tasks on the 
other cores. On Figure [3l the slack after T 3 may be executed at different frequencies whether it is executed 
in parallel with Ti or T 2 . 

To ease the presentation, we assume that each task Ti receiving a message (from a task Tk) is followed 
by a slack task, denoted Tsi. The beginning of Ts^, denoted hTsi is exactly equal to the end of T^, 

hTsi = eTi ( 2 ) 

whereas its end time, denoted eTs^, is at least equal to the arrival time of the message from T^. Let Ml 
denote the transmission time from to Ti. Thus: 

eTsi > eTk + M^ (3) 

Note that a task may receive messages from different processes (after a collective communication for 
example) and equation |3] has to be valid for all of them. 

Finally, since T^ , the successor task of Ti has to start after Ti and Tk finish, one just needs to make sure 
that: 

bTj = eTsi 

In order to compute the end time of a task Ti (eTi), one has to evaluate the execution time of Ti. As 

r 

explained above, a task may be executed at different frequencies. Let execi be the execution time of T if 
executed completely at frequency /. Every frequency can be used to run a fraction d( of the total execution of 
the task. Let tTf be the fraction of time Ti spends at frequency /. It can be expressed as: tTf = 6{ x exec{. 
Thus, the end time of a task is: 


eTi = bTi + '^ tT/ 
f 

Note that one has to make sure that a task is completely executed: 

/ 

Finally, since the power consumption depends on the frequency, let P/ be the power consumption of the 
task Ti when executed at frequency /. Using this formulation, the objective function of the linear program 
becomes: 


min(^(^{tTl x pI))) 

Ti f 


( 5 ) 




One can just use tx/ in the objective function as it is expressed in equation ([S]), and the solver would 
provide the values of tx/ of all tasks at all frequencies. This solution was presented in [?]. The provided 
solution can be used on different architectures than the ones we target in this work. As a matter of fact, 
nothing constrains parallel tasks on one processor to run at the same frequency, and the threshold of switching 
frequency is not considered either. Moreover, no constraint on the execution time is expressed. The following 
paragraphs first describe how the performance is handled then they introduce additional constraints the 
handle the architecture constraints and execution time. 


3.2 Execution time constraints 

The performance of an application is a major concern; whether the energy consumption is considered or not. 
In this paragraph we provide constraints which consider the execution time of the application. In MPI, all 
programs end with MPXFinalize which is similar to a global barrier. Let lastdask^ be the last task on core 
i (the MPXFinalize task). Since the application ends with a global communication, every task lastdasX is 
followed by a slack task last.slackJtask'^. The difference between the global communication slack and the 
other slack tasks lies in the end time: the end time of all slack tasks of a global communication is the same 
(all processes leave the barrier at the same time). Thus, for every couple of cores (i, j): 

elastslackJasX = elast_slackJ,ask^ ( 6 ) 

Let totaXXime be the application execution time: It is equal to the end time of the last slack task. 

totaXXime = elast_slackJask‘^ (7) 


However, in some cases, increasing the execution time of an application could benefit to energy consump¬ 
tion. In order to allow this performance loss to a specified extent, the user limits the degradation to a factor 
X of the maximal performance. Let execXFime be the execution time when all tasks run at the maximal 
frequency, and x the maximum performance loss percentage allowed by the user. The following constraint 
allows performance loss with respect to x: 


totaXXime < execXXime 


exec-Xime x x 

Too 


The next sections describe two different formulations. In the first formulation, the solver is provided 
with all possible task configurations and chooses the one minimizing energy consumption. In the second 
formulation, the solver provides the exact time of every frequency switch on each processor. 


3.3 Architecture constraints: the workload approach 

In order to provide the optimal frequency schedule, the linear program is provided with all possible task 
configurations, i.e., all possible of parallel tasks, known as workloads. Then the solver provides the execution 
frequency of each workload. 

3.3.1 Shared frequency constraint 

We need to express that tasks executed at the same time on the same processor run at the same frequency. 
Hence, we first need to identify tasks executed in parallel on the same processor. Depending on the fre¬ 
quency being used, the set of parallel tasks may change. Figure H] is an example of two different executions 
running at the maximal and minimal frequency. Only processes that belong to the same processor are 
represented. In Figure 133 when the processor runs at fjmax, the set of couple of tasks which are par¬ 
allel is: {(Ti, T 3 ), (Ti, Tsa), (Tsi, Tsa), (r 2 , T 4 )} (represented by red dotted lines). When the frequency is 
set to fzmin (Figure |4b|, the slack after Ta is completely covered and the set of parallel tasks becomes: 
{(ri,Ta),(Tsi,Ta),(r2,T4)}. 



bW, 

Beginning of a workload Wi 

elF, 

End of a workload Wi 

tw/ 

The time a workload Wi is executed at frequency / 

dWi 

The duration of a workload 

tw/ 

A binary variable used to say if a workload is executed at a frequency / or not 


Table 2: Workload formulation variables 


In order to provide all possible configurations, we define the processor workloads. A workload, denoted 
Wi is tuple of potentially parallel tasks. In Figure 01 Wi = (TijTa), W 2 = (TsijTs), W 3 = (TijTsa) 
represent a subset of the possible workloads. Note that there are no workloads with the same set of tasks. 
In other words, once a task in a workload is over, a new workload begins. On the other hand, a task can 
belong to several workloads (like Ti in Figure Ha)) . 




(a) fjnax 


(b) fjnin 


Figure 4: Workloads 

Recall that our goal in to calculate the fraction of time a tasks should spend at each frequency {tx /) in 
order to minimize the energy consumption of the application according to the objective function ([5]). Since 
tasks may be executed at several frequencies, so does a workload. In Figure EJ the workload Wi = {Ti.T^) 
is executed at frequency /i then at frequency / 2 . Thus, since Ti belongs to both Wi = (Ti,T 3 ) and 
W 2 = (Ti,Ts 3 ), the execution time of Ti at frequency /i can be calculated by using the fraction of 

time Wi and IF 2 spend at frequency /i. In other words, the execution time of a task can be calculated 
according to the execution time of the workloads it belongs to. Let tw/ be the fraction of time the workload 
Wi spends at frequency /. Thus: 

tTi = Y. ( 8 ) 

Wj,TieWj 

Using the execution time of a workload at a specific frequency (tw/), one can calculate the duration of a 
workload, dWi as: 
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Figure 5: Workloads and tasks execution 













































3.3.2 Handling frequency switch delay 


dw^ = '^tw/ 

f 


Recall that one of the problems when considering DVFS is the time required to actually set a new frequency. 
Thus, before setting a frequency, one has to make sure that duration of the workload is long enough to 
tolerate the frequency change since changing frequency takes some time. In other words, if the frequency / 
is set in a Wi, tW/ is larger than a user-defined threshold, denoted Th. 


: tW/ >Thx tw/ 


(9) 


tw/ is a binary variable used to guarantee that definition (jH]) remains true when tw/ = 0 . 


tw/ = 


tw/ = 0 

otherwise 


( 10 ) 


The expression of definition (nni) as a mixed binary programming formulation is expressed in the appendix. 


3.3.3 Valid workload filtering 

f 

The linear program is provided with all possible workloads, then it provides the different tWj for each 
workload. However, all workloads cannot be present in one execution. In Figured Wi = (TijTsa) and 
W 2 = (TsijTs) are both possible workloads, but they cannot be in the same execution, because if Wi is 
being executed, it means that T 3 is over (since Ts^ is after T 3 ) thus W 2 cannot appear later since Tsi 
and T 3 are never parallel. Thus, in order to prevent Wi and W 2 from both existing in one execution, we 
just need to check whether the tasks of the workload can be parallel or not. Two tasks are not parallel if 
one ends before the beginning of the second. Since we consider workloads, we focus only on the beginning 
and end time of the workload itself. Let bWi and eWi be the start time and the end time of the workload 
Wj = (Ti,..., Ti,..., T„). They are such that: 

bWj >= bTi ( 11 ) 

eWj <= eT, (12) 


Note that although the beginning and the end of the workload are not exactly defined, this definition makes 
sure that the beginning or the end of a task start a new workload. Moreover, the complete execution of a 
task are guaranteed thanks to equations ([4]) and ([ 8 |). 

Figure [ 6 ] is an example of a workload that cannot exist. Let us assume the execution represented in 
Figure ini and let us focus on the workload Wi = (TijTss). Let us also assume that with other frequencies, a 
possible workload is W 2 = (T 3 , Tsi). As explained above, Wi and W 2 cannot both exist in the same execution 
because of precedence constraints. It is obvious from the example that T 3 and Tsi are not parallel, let us 
see how it translates to workloads. Since W 2 has to start after both T 3 and Tsi begins, then it starts after 
Tsi (since bTsi > bT^ Figure [ 6 |). The same way it ends before eT^. But since eT^ < bTs\ (as shown in 
Figure ED then the duration of W 2 should be negative which is not possible. 

Thus, we identify workloads which cannot be in the execution as workloads which end before they begin. 
The duration of a workload is such that: 


dWi = 


0 

eVF, - bW^ 


eW, < bWi 
otherwise 


(13) 


In the appendix (section E]), we proove that if two workloads cannot be in the same execution (because 
of the precedence constraints), then the duration of at least one of them is 0 (paragraph 16.4.21) . 
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Figure 6: Negative workload duration for impossible workloads 


3.3.4 Discussion 

The appendix (section [5]) provides a detailed formulation of the energy minimization problem using work¬ 
loads. The formulation shows the use of two binary variables: one to express the threshold constraint and one 
to calculate the duration of the workload. With these two variables, the formulation is not linear anymore, 
which requires more time to solve (especially when the number of workloads is important). 

Moreover, we tried providing all possible workloads of one of the NAS parallel benchmarks on class C on 
16 processes (IS.C.16) on a machine equiped with 16 GB of memory. The application task graph is composed 
of 630 tasks. The generated data {i.e. the number of workloads) could not fit in the memory of the machine. 
Thus, even with no binary variables, providing all possible workloads is not possible when considering real 
applications. 

In the following section, we provide another formulation which requires only the task graph. 

3.4 Architecture constraints: the frequency switch approach 

As explained earlier, our goal is to minimize the energy consumption of a parallel application using DVFS. 
In order to do so, we express the problem as a linear program. We consider that the program is represented 
as a task graph and each task can have several phases. The difficulty of the formulation is to provide, for 
each task, the frequency of each of its phases {tTf) since one has to make sure that parallel tasks must run 
at the same frequency. In this section, we provide another formulation which considers the time to set a new 
frequency on the whole processor instead of considering tasks independently and then force parallel tasks to 
run at the same frequency. 

3.4.1 Frequency switch overhead 

Let cjp be the time the frequency / is set on the processor p, j being the sequence number of the frequency 
switching. Figure [7] represents the execution of four tasks on two cores of the same processor p. In the 
example, we assume that there are only 3 possible frequencies. The different are numbered such that the 
minimum frequency /i corresponds to the switching time c{p,C 4 p,..., the frequency /2 corresponds to the 
frequency changes c^pjC^p,... and so on. A frequency /i is applied during a time which can be calculated 
as c|^_|_ 4 }p — cfp. This can be translated to: 


^{i+l}p - Gp 


c( Time of the frequency switch on processor p. The frequency / is the one set 
di^ The amount of time a frequency / is set for the task i for the frequency switch j 


Table 3: Frequency switch formulation variables 
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Figure 7: Frequency switches example 


Note that some frequencies may not be set if the duration is zero. In figure 0 frequency /a is not set 
h 

since Cgj = C 4 J. 

3.4.2 Handling frequency switch delay 

As explained earlier, changing frequency takes some time. Thus, for a change to be applied, its duration has 
to be longer than the user-defined threshold Th. Let C,(j, be a binary variable, such that: 

1 1 otherwise 

The threshold condition can be expressed as: 

C{i+i}p -c{p>Thx C/p 

We detail how equation (1141) is translated into mixed binary programming constraints in the appendix. 


3.4.3 Shared frequency constraints 

Once the threshold condition is satisfied, one can calculate the time a task spends at each frequency, i.e 
tT/, according to On Figure [71 initially, tasks Ti and T 3 run in parallel at frequency /i. The time T 3 
spends at frequency fi is C 21 — c(\ whereas Ti is executed twice at /i. It spends (c^i — c(\) + {eTi — c{\) at 
frequency /i. Let dL be the time the task Ti spends at frequency / after the frequency switch j. Back to 

Figure 13 dfj = c^l — and d{\ = eTi — c{\. becomes = dfj -|- d{\. 

The above translates to: 

tTi = 4 

j 

Note that a task is not impacted by a frequency change if it ends before the change or begins after the 
next change. In other words, d{j = 0 if eTi < or bTi > Otherwise, dfj can be calculated as 

- max{bTi,c-^jp). 


d- 


0 

min{eT^, 4+i}p) “ rnax{bTi, 


eTi < c-^jp or bT^ > c 
otherwise 


/' 

{i-|-l}p 


(15) 


3.5 Discussion 

The appendix (section ( 6 ]) provides the complete formulation of the problem using the frequency switch time 
variables. In addition to the binary variable used to satisfy the frequency switch overhead, for each task and 
for each frequency switch, five additionnal binary variables are used. Thus, for n tasks and m frequency 














switch considered, 5 x n x m binary variables are required. Mixed integer programming is NP-hard [?], thus, 
with such a number of binary variables, no solution can be provided. 

When comparing the workload approach and the frequency switch approach, one can notice that the 
former needs less binary variables and should be able to provide results. However, because all possible 
workloads have to be provided to the solver, it is as complex because of the memory required. Thus, if a 
very large memory is available, then the workload solution is the one to be used. And if new faster binary 
resolution techniques are provided, then the frequency switch solution should be used. 

Several heuristics can be assumed in order to reduce the time to solve the problem. First, one can 
consider iterative applications, and solve the problems for only one iteration then apply it the remaining 
ones. However, this solution strongly depends on the number of tasks per iterations. We tried this solution 
on some kernels (NAS Parallel Benchmarks [?]) and the solver could not provide any result after several 
hours. 

The most promising heuristic is to consider the tasks at the processor level instead of the core level. 
Thus, the only architecture constraint which needs to be considered is the frequency overhead one. This 
study is part of our current work and will be discussed in further studies. 


4 Related Work 

DVFS scheduling has been widely used to improve processor energy consumption during application execu¬ 
tion. We focus on studies assuming a set of dependent tasks represented as a direct acyclic graph (DAG). 

A lot of studies tackle task mapping problem while minimizing energy consumption either with respect to 
task deadlines [?] or by trying to minimize the deadline as well [?]. When considering an already mapped task 
graph, studies provide the execution speed of each task depending on the frequency model: continuous [?] 
or discrete [?]. Some studies also provide a set of frequencies to execute a task [?] (executing a task 
at multiple frequencies is known as VDD-Hopping). In [?], the authors present a complexity study of the 
energy minimization problem depending on the frequency model (continuous frequencies, discrete frequencies 
with and without VDD-Hopping). Finally studies like [?] and [?] consider frequency transition overhead. 
Although these studies should provide an optimal frequency schedule, they do not consider the constraints of 
most current architectures and more specifically the shared frequency among all cores of the same processor. 

When considering linear programming formulation to minimize application energy consumption, many 
formulations have been proposed in the past. When considering single processor, [?] provides an integer linear 
programming formulation with negligible frequency switching overhead. The same problem but considering 
frequency transition overhead was addressed in [?]. The author also provide a linear-time heuristic algorithm 
which provides near-optimal solution. 

The work presented in [?] is the closest to the work presented in this paper. In [?], the authors present a 
linear programming formulation of the minimization energy problem where tasks can be executed at several 
frequencies. Both slack energy and processor energy consumption are considered in the minimization and a 
loose deadline is considered. In a similar way, [?] provides a scheduling algorithm and an integer linear pro¬ 
gramming formulation of the energy minimization problem on heterogeneous systems with a fixed deadline. 
The formulation is very close to the one described in [?], but the authors also considered communication 
energy consumption. However, they do not consider slack time and its power consumption when solving the 
problem. In [?] the authors use an integer linear programming formulation of the problem where only task 
with slack time are slowed down, whereas other tasks are run at maximal frequency. The program is used 
to compute the best frequency execution of a task. 

Although previous studies provide different solutions and formulations for DVFS scheduling, few of 
them consider current architecture constraints. While some previous studies consider frequency transition 
overhead [?, ?], none of them consider the fact that cores within the same processor run at the same frequency. 
This paper describes a mixed linear programming formulation that guarantees that parallel tasks on the same 
processor run at the same frequency. Moreover, it shows that it is possible to relax the deadline if it leads 
to energy saving. 



5 Conclusion 


The goal of this paper was to provide a study on how energy minimization problem of a parallel execution of an 
MPI-like program can be addressed and formulated when considering most current architecture constraints. 
In order to do so, we used linear programming formulation. Two different formulations were described. Their 
goal is to minimize the energy consumption with respect to a user-defined deadline by providing the optimal 
frequency schedule. Both solutions use a number of binary variables which is proportional to the number 
of tasks. Used as they are, these formulations should provide an optimal solution but are costly in terms of 
memory and resolution time, despite the use of fast parallel solvers like gurobi [?]. 

We are currently working on introducing heuristics to relax the architecture constraints by building tasks 
on the processor level instead of the core level. Using such heuristics seems to drastically reduce the time 
needed to solve the problem. 

6 Appendix 

This appendix summarizes the set of constraints of both formulations described in paragraphs 13.31 and 13.41 
We start by describing how each non linear constraint which appears in sections 13.31 and 13.41 is expressed. 
For a more complete description and explanation, the reader can refer to [?]. 

6.1 Expressing non linear constraints 

Section |3] presents different non continuous variables 1 definitions ITOl (IT^ and (IT4l) . (fT^ b In this section, we 
briefly explain how this kind of expressions translates to inequalities using binary variables. 

1. If-then statement with O-I variables: Expressing conditions like: 

0 1 = 0 
1 otherwise 

(for instance, definition [T0|) requires the use of a large constant M such that: 

X < M XX 
X > X X e 

Thus, when x = 0, (II3 forces X to be equal to 0 and when x ^ 0, (ITbl) is used to set the value of x to 

I. 

Note that, equation which guarantees that tw/ > Th x tw/ makes ini) useless (since Th > e). 
Thus, (fTT]) is never used in the set of constraints. 

2. If-then statement with real variables: Expressing formulas like: 

0 y < X 
y — X otherwise 

(definition (|13l) for instance) is similar to the previous formulation in the sens that it requires the use 
of a big constant M. A binary variable bin is used such that when y — x < 0, bin = 0. 

y — X < M X bin (18) 

X — y < M X (1 — bin) (19) 

Thus, when y < x, (fT51) is always valid regardless the value of bin. Hence, m forces bin to be equal 
to 0. Similarly, when y > x, equation m forces bin to I. 



(16) 

(17) 




Once bin is defined, z can be expressed as: 


y — X < z < M X bin (20) 

y — X + z < 2 X {y — x) + M X {1 — bin) (21) 

Thus, when y < x, bin = 0 (from (fTHlD and (l20l) forces z to be 0 (since all variable are positive) and 
(1^ is always valid. Similarly, when y > x, bin = 1 (from (fTOll l and ((^ and (1^ become: 

y — X < z < M 
z < y — X 

Thus y — x<z<y — X which makes z = y — x. 

3. Maximums: Maximums can be expressed by reformulating the definition as: 


z = max{x, y) = X -\- 


0 

y-x 


x>y 

otherwise 


Let w be such that: 



x>y 

otherwise 


We can express w by using (l20)l and (1^ . 


4. Minimums: Expressing minimums is based on the same idea than expressing maximums: 


z = min{x, y) 


x — {x 


y) 


0 

x-y 


x<y 

otherwise 


We do not detail how minimums are expressed, since it is done the same way as maximums. 


5. Expressing several conditions: In definitions like dig, several conditions can force the value of a 
variable. 

{ 0 X < y or z > u 
0 otherwise 

Translating such definitions into inequalities requires the use of one binary variable for each condition 
and one binary variable to express the “or”. 


Let 6 fnl, bin2 be such that: binl = 


1 if z — u > 0 


and bin2 = 


1 if X — y <Q 
0 otherwise 


0 otherwise 
These two definitions can be expressed using dm) and dni). 

Finally binZ is a binary variable which is equal to 1 if binl or bin2 are equal to 1 and 0 otherwise: 


binZ = 


1 binl + bin2 > 1 
0 otherwise 


( 22 ) 


Since binl, bin2 and binU are binary variables, (1221) can be easily expressed as: 


binl 

< 

bin3 

(23) 

bin2 

< 

bin3 

(24) 

bin3 

< 

binl + bin2 

(25) 


Thus, when binl and bin2 are 0, (1251) forces 6m3 to be 0 whereas when binl or bin2 are equal to 1, 
(l23)) and [24] forces bin3 to be equal to 1. 


6.2 Objective function 

Minimizing the energy consumption of a program described as a set of tasks is the objective function of the 
linear programming formulations described above. For a task with a power consumption at a frequency 
/, P/ and executed at frequency / during tP/, the energy consumption of the whole program for its whole 
execution time is: 


X P/ jj) 

Ti f 


6.3 Task constraints 

Let Ti,Ti+i,Ti_|_ 2 ,Tj be four tasks such that: T^,Ti+i,Ti +2 are consecutive and on the same processor. Ti 
ends with a message sending creating Ti+i which ends with a reception from Tj which generates Ti _|_2 as 
shown in Figure [51 


T, 


Tsi 

Pa 


Figure 8: Task configuration 


eTj 

= 

bT, + Y.^Tl 


— 

f 

1 

/ 

bTi^i 

= 

eTi 

bTsi+i 

= 

eTi+i 

cTSi-\-i 

> 

eT, + M;+i 

cTSi-\-i 

> 

bTs^+i 

bTi+2 

= 

PT Si-\.\ 

tpf 

= 

S{ X execP/ 


6.4 Workload approach 

6.4.1 Additional variable 

7 i : A binary variable used to say if a workload duration is 0 or not 
M : A large constant 




bW, 

> 

bTj 

elF, 

< 

eTj 

tri 

= 

E twj 



Wi 



TidWi 

dW, 

= 

Etw/ 


/ 












Using (USD, (EZD and we express definition m as: 


tw/ > Thx tw/ 
twf < M X tw/ 

Using (HU), dnl), (ED]) and (EU and 7 i as the binary variable, we express definition 0 as: 


eWi - bWi < 

bW, - eWi < 

eWi - bWi < 

eWi - bWi + dWi < 


M X 7* 

M X (1 - 7i) 

dW^ 

2 X {eW,-bWi)+Mx (1 


,7i G {0,1} 

< M X 

li) 


6.4.2 Proof of workload duration 

We want to proove that if two workloads W and W' are possible, but they violate the precedence constraint 
between the tasks, then the duration of at least one of them is zero. We provide the proof for workloads 
with a cardinality equals to 2 since the proof remains the same for larger workloads. 

Let W = (Ti,Tj) and W' = (T',T') such that Ti preceeds T' and T' preceeds Tj. We want to prove that 
dW = 0oT dW' = 0. 


Lemma 1. Let W = (Ti,Tj) and W' = (T',T'). If bT' > eTi and bTj > eT', then dW = 0 or dW' = 0. 


Proof. Let us proove lemma [^.4.21 bv contradiction. Let us assume that dW ^ 0 and dW' 0. 
From definition OT: ^ ^ ^ eW'>bW' 

From constraints 0 and 0: 


bW > bT 


bW' > bT' 

bW > bTj 

and 

bW' > bT'j 

elU < eT 

(26) 

eW' < eTl 

eW < eTj 


eW' < eT'j 


But bTl > eTi and bTj > eT/, thus: 

bW > bTj > eT' > eW' (27) 

bW' > bT' > eT, > eVF (28) 

If we consider EZD, E3) and EE): 

bW' > bT' > eT >bW> eW' 

Thus bW' > eW which by definition (1101) implies that dW' = 0 which leads to a contradiction. □ 


6.5 Frequency switch approach 

Note that we do not detail how the threshold condition is handled since it is done the same as for the 
workloads. 






6.5.1 

/-f 


yu 


a: 


1-3 

J 


Pi3 

p^. 

Fij 

M 


Additional variables 

A binary variable used to say if a workload is executed at a frequency / or not 

£ 

The maximum between bTi and c, 

f f 

A variable used to express yb. it is equal to 0 if bTi is the maximum, and — bTi otherwise 

A binary variable used to verify whether bTi > c:[ 

f 

The minimum between eTi and 

A variable used to express It is equal to 0 if eTi is the minimum, and eTi — otherwise 

A binary variable used to verify whether eTi < 

A binary variable used to check if bTi — c;f > 0 
A binary variable used to check if eTi — c( < 0 

f r 

A binary variable used to check if •(/jb or (/)b are true 
A large constant 


6.5.2 Constraints 


C{i+l}p 


> 

J 

^ip 

/' 

7i+l}p 

-c^ 

> 

Th X C/, 

/' 

7i+l}p 

-cf 

< 

M X d 

tTf 


= 

s4 


3 


Expressing definition p5ll as inequalities requires the use of (12011 and (j21ll for the maximum and the minimum 
such that: 


yi3 


max{bTi,Cjp) 

bTi + w{^ 


such that: 


4 - bT^ 


if 


bTi is the maximum 
otherwise 




= eT, - g. 


J 




0 if eTi is the minimum 

— eTi otherwise 


Let af be the binary variable used for the maximum and pf the one used for the minimum. By replacing 
the corresponding variables in (1201) and (1211) , we obtain the following inequalities for the maximum: 


-bT 


''3P 


bT, - c{ 

J 


3P 

-bT, 

f _ hT. _L 

^ij 


c^3P - bT, 


< M X af 

< M X (1 - af) , af G {0,1} 

< wf < M X af 

< 2xicl-bT,) + Mx{l-af) 


and the following for the minimum: 


eT, - cf^,y^ < 

4+i}p 7 ^ 

p'^- _<r 

eT, - + gf < 2 X [eTi 


M X 13 f 

Mx{l- 4 ) 

9(3 

c{'+i|p) + M X (1 - 4) 


,4 e{o,i} 

< M X pf 


Finally, using (1^ . (IMl) and (1^ and the binary variables iff, pf and pf as binl, bin2 and bin3 respectively 
and using (l20l) and (1^ . dij can be expressed as: 



(f)^ ■ < pf- 

% ^ Pij 

Pi} ^ ^3 + ^(3 

4 - 4 < 4 < M X (1 - p{j) 

4 “ 4 + 4 < 2 X (z/^- - 4) + ^ X 



